This document is a walkthrough of how to use the code for the Turnout Tracker. For a discussion of the math and the model, see Turnout Tracker Math
There are a number of parameters that need to be fit on historical data: baseline turnout rates, precinct covariances, etc.
Each election should have a config, which I’ve created in config.R. config is a list with the following items:
source("util_tracker.R")
election_dir <- "elections/phila_201911"
add_election_path <- function(file) sprintf("%s/%s", election_dir, file)
library(tidyverse)
source(add_election_path("config.R"))
print(config)
## $city
## [1] "Philadelphia"
##
## $city_filename
## [1] "philadelphia"
##
## $timezone
## [1] "America/New_York"
##
## $election_ds
## [1] "2019-11-05"
##
## $start_hour
## [1] 7
##
## $end_hour
## [1] 20
##
## $precinct_shp_path
## [1] "data/Political_Divisions.shp"
##
## $get_precinct_id
## function (df)
## df$DIVISION_N
##
## $get_year_from_election
## function (election)
## as.numeric(substr(election, 1, 4))
##
## $get_etype_from_election
## function (election)
## substr(election, 6, nchar(as.character(election)))
##
## $get_ward_from_precinct
## function (precinct)
## substr(precinct, 1, 2)
##
## $turnout_df_path
## [1] "data/phila_turnout.csv"
##
## $submission_bitly
## [1] "http://bit.ly/sixtysixturnout"
##
## $google_doc
## [1] "docs.google.com/spreadsheets/d/1GCPVCim0T5Kt4Qveotibx8pDyR2ZPVlCjpUFAMPy9F4"
##
## $test_data
## [1] "docs.google.com/spreadsheets/d/11g0yRMF6VQqfZCHOC2S2jL62FgdiKVMq7KzUkcPhy1w"
##
## $ref_turnout
## 2015 2011
## 238664 200834
##
## $site_name
## [1] "Sixty-Six Wards"
##
## $site_link
## [1] "https://sixtysixwards.com/"
##
## $precinct_name
## [1] "division"
##
## $ward_name
## [1] "ward"
##
## $map_legend_pos
## [1] 0.7 0.1
The helper function prep_shapefile will load the shapefiles, process them, and then save sf objects precincts.Rds and `wards.Rds in data.
source("prep_shapefiles.R")
prep_shapefile(
add_election_path(config$precinct_shp_path),
config$get_precinct_id,
config$get_ward_from_precinct,
save_dir = add_election_path("data")
)
## Reading layer `Political_Divisions' from data source `C:\Users\Jonathan Tannen\Dropbox\sixty_six\posts\turnout_tracker\tracker_v0\elections\phila_201911\data\Political_Divisions.shp' using driver `ESRI Shapefile'
## Simple feature collection with 1703 features and 5 fields
## geometry type: POLYGON
## dimension: XY
## bbox: xmin: -75.28031 ymin: 39.86748 xmax: -74.95574 ymax: 40.13793
## epsg (SRID): 4326
## proj4string: +proj=longlat +datum=WGS84 +no_defs
Before election day, we need to calculate the historic fixed effects and correlations. All of the prep work is done in calc_params. The input is a dataframe, turnout_df, which has columns precinct, year, turnout. Precinct is the unique identifier for the precinct, year is the year, and turnout is the voter count. You will need to crosswalk to the present-day precincts if boundaries have changed.
df <- read_csv(add_election_path(config$turnout_df_path), col_types = "ccd")
head(arrange(df, precinct, election))
## # A tibble: 6 x 3
## precinct election turnout
## <chr> <chr> <dbl>
## 1 0101 2002 general 185
## 2 0101 2002 primary 141
## 3 0101 2003 general 213
## 4 0101 2003 primary 30
## 5 0101 2004 general 311
## 6 0101 2004 primary 98
We can now calculate the historic modelParams:
source("precalc_params.R")
params <- calc_params(
turnout_df=df,
n_svd=3
)
## [1] "Fitting fixed effects"
## [1] "Calculating svd"
## [1] "Fitted vs True values, check for similarity:"
## [1] "Fitted:"
## [,1] [,2] [,3] [,4] [,5] [,6]
## [1,] -0.23595535 -0.279344484 -0.3070223 -0.6288146 -0.2514875 -0.09150797
## [2,] -0.22811672 -0.265536131 -0.2968610 -0.5935173 -0.2514009 -0.08536257
## [3,] -0.19215436 -0.186998666 -0.2671716 -0.4281893 -0.3302952 -0.01147875
## [4,] -0.17547762 -0.250561089 -0.2557691 -0.6782176 -0.1941783 -0.03904310
## [5,] -0.00709346 0.004097716 -0.0654225 -0.1291420 -0.2025106 0.12357970
## [6,] -0.11235253 -0.117818969 -0.1448739 -0.2472072 -0.1430513 -0.03618946
## [1] "True:"
## 2002 general 2002 primary 2003 general 2003 primary 2004 general
## 1 -0.166995789 -0.17009508 -0.16509206 -0.6447489 -0.19832982
## 2 -0.301279819 -0.34155965 -0.25568259 -0.6112395 -0.24198783
## 3 -0.208756269 -0.09552181 -0.22291624 -0.3770327 -0.32402475
## 4 -0.074715297 -0.06527169 -0.03715024 -0.8556547 -0.23334492
## 5 0.015616245 0.01463587 -0.04731135 -0.1188036 -0.06146578
## 6 -0.009966313 0.03661322 -0.19591997 0.6108928 -0.04656912
## 2004 primary
## 1 0.154441432
## 2 0.031745017
## 3 0.066967713
## 4 0.008014998
## 5 0.089799005
## 6 0.448563604
## [1] "Calculating covariances"
## params has a copy of turnout_df, with some new columns.
print(head(params@turnout_df))
## # A tibble: 6 x 5
## precinct election turnout log_turnout precinct_num
## <fct> <chr> <dbl> <dbl> <dbl>
## 1 0101 2002 general 185 5.23 1
## 2 0101 2002 primary 141 4.96 1
## 3 0101 2003 general 213 5.37 1
## 4 0101 2003 primary 30 3.43 1
## 5 0101 2004 general 311 5.74 1
## 6 0101 2004 primary 98 4.60 1
## params has an estimate of the election_fe, on the log scale.
print(head(params@election_fe))
## # A tibble: 6 x 2
## election election_fe
## <chr> <dbl>
## 1 2002 general 5.39
## 2 2002 primary 5.12
## 3 2003 general 5.53
## 4 2003 primary 4.07
## 5 2004 general 5.94
## 6 2004 primary 4.44
## params has an estimate of the precinct_fe, on the log scale.
print(head(params@precinct_fe))
## # A tibble: 6 x 2
## precinct precinct_fe
## <fct> <dbl>
## 1 0101 0.00488
## 2 0102 0.251
## 3 0103 0.508
## 4 0104 0.248
## 5 0105 -0.291
## 6 0106 -0.230
## params has the svd results, which is used for the covariance (more on this later).
print(head(params@svd$u))
## [,1] [,2] [,3]
## [1,] -0.03703140 -0.010902274 -0.02877904
## [2,] -0.03604455 -0.009483524 -0.02708452
## [3,] -0.03636111 0.005530277 -0.02155213
## [4,] -0.03190417 -0.009887940 -0.03619283
## [5,] -0.01523150 0.019150560 -0.01457941
## [6,] -0.01810269 -0.002073228 -0.01081952
print(head(params@svd$v))
## [,1] [,2] [,3]
## [1,] 0.2116287 0.11211068 -0.11009361
## [2,] 0.1593407 0.16341858 0.02743948
## [3,] 0.2371900 0.07910834 -0.03714818
## [4,] 0.1696416 0.21855651 0.49592257
## [5,] 0.2725468 -0.14067139 -0.07568843
## [6,] 0.1013064 0.20417139 -0.16007275
## params has the estimated covariance matrix among precincts (and its inverse)
print(params@precinct_cov[1:6, 1:6])
## [,1] [,2] [,3] [,4] [,5] [,6]
## [1,] 0.08813260 0.06308647 0.05620049 0.06204101 0.02086924 0.02953503
## [2,] 0.06308647 0.08319649 0.05455631 0.05951960 0.02071391 0.02851708
## [3,] 0.05620049 0.05455631 0.07791732 0.05234524 0.02783795 0.02684974
## [4,] 0.06204101 0.05951960 0.05234524 0.08282559 0.02041623 0.02748571
## [5,] 0.02086924 0.02071391 0.02783795 0.02041623 0.04475905 0.01122282
## [6,] 0.02953503 0.02851708 0.02684974 0.02748571 0.01122282 0.03617935
I also provide some helper functions to make diagnostic plots. These require an sf object with the precinct shapefiles. (The outputs of prep_shapefile suffice)
The diagnostics include plots of (a) the fixed effects by precinct and by year, and (b) the svd components for the estimated covariances, along with each dimension’s score in each year. You should sanity check that the combination of precincts and elections make sense.
library(sf)
## Finally, check out some diagnostic plots...
divs <- readRDS(add_election_path("data/precincts.Rds"))
diagnostics(params, divs, config, pause=FALSE)
## [1] "Plotting Diagnostics..."
The plots look good. Dimension 1 is blue for Hispanic North Philly and the University of Pennsylvania, and the line plot shows that these precincts had disproportionately high turnout in 2004, 2008, 2012, 2016 (the presidential elections). Dimension 2 has captured population change (red divisions are increasing, blue divisions are decreasing). Dimension 3 is hard to interpret, and may be noise/misfitting…
Let’s save the results and move on.
save_with_backup(params, stem="params", dir=add_election_path("outputs"))
An important validation is to test the model on a fake, known distribution. The function load_data will either load data from our google-form download (later), or create a fake dataset with an S-curve.
source("fit_submissions.R")
set.seed(215)
data_list <- load_data(use_google_data=FALSE, params=params, election_config=config)
## [1] "True Turnout"
## [1] 103336.8
raw_data <- data_list$raw_data
fake_data <- data_list$fake_data
em_fit <- fit_em_model(
raw_data, params, verbose=FALSE, tol=1e-10, use_inverse=FALSE, election_config = config
)
## [1] "n_iter = 72"
fit <- process_results(
model_fit=em_fit,
election_config=config,
calc_ses=FALSE,
verbose=TRUE,
plots = TRUE,
save_results = FALSE,
fake_data=fake_data,
pause=FALSE
)
## [1] "predicting loess"
## [1] "plots"
## [1] "div_turnout"
## [1] "time_df"
## [1] "full_predictions"
print("Estimate:")
## [1] "Estimate:"
fit@full_predictions %>%
filter(time_of_day == max(time_of_day)) %>%
with(sum(prediction))
## [1] 106721.8
But we don’t want a single estimate, we want a bootstrap of estimates. This can take a few minutes, so I’ll just do 40 iterations for the demo…:
source("bootstrap.R")
bs <- fit_and_bootstrap(
raw_data=raw_data,
params=params,
election_config=config,
n_boot=40,
use_inverse=FALSE,
verbose=TRUE
)
## [1] "Raw Result"
## [1] "n_iter = 72"
## [1] "Raw Result: 106721.798776348"
## [1] "Boot 1"
## [1] "n_iter = 91"
## [1] "1: 103783.409137528"
## [1] "Boot 2"
## [1] "n_iter = 100"
## [1] "2: 102437.339998497"
## [1] "Boot 3"
## [1] "n_iter = 105"
## [1] "3: 109170.262963837"
## [1] "Boot 4"
## [1] "n_iter = 103"
## [1] "4: 106618.415754892"
## [1] "Boot 5"
## [1] "n_iter = 83"
## [1] "5: 99929.889286891"
## [1] "Boot 6"
## [1] "n_iter = 100"
## [1] "6: 107673.380326182"
## [1] "Boot 7"
## [1] "n_iter = 116"
## [1] "7: 102493.628499471"
## [1] "Boot 8"
## [1] "n_iter = 107"
## [1] "8: 109623.259988905"
## [1] "Boot 9"
## [1] "n_iter = 96"
## [1] "9: 105539.19655515"
## [1] "Boot 10"
## [1] "n_iter = 118"
## [1] "10: 106381.731451132"
## [1] "Boot 11"
## [1] "n_iter = 97"
## [1] "11: 107158.6855022"
## [1] "Boot 12"
## [1] "n_iter = 109"
## [1] "12: 110782.154619117"
## [1] "Boot 13"
## [1] "n_iter = 139"
## [1] "13: 107730.203909652"
## [1] "Boot 14"
## [1] "n_iter = 104"
## [1] "14: 104152.097007173"
## [1] "Boot 15"
## [1] "n_iter = 121"
## [1] "15: 106920.113316562"
## [1] "Boot 16"
## [1] "n_iter = 104"
## [1] "16: 106163.455156913"
## [1] "Boot 17"
## [1] "n_iter = 84"
## [1] "17: 103292.158472231"
## [1] "Boot 18"
## [1] "n_iter = 94"
## [1] "18: 104530.583070934"
## [1] "Boot 19"
## [1] "n_iter = 96"
## [1] "19: 107184.002776702"
## [1] "Boot 20"
## [1] "n_iter = 102"
## [1] "20: 105188.19526865"
## [1] "Boot 21"
## [1] "n_iter = 116"
## [1] "21: 107150.963935033"
## [1] "Boot 22"
## [1] "n_iter = 98"
## [1] "22: 104521.910194006"
## [1] "Boot 23"
## [1] "n_iter = 102"
## [1] "23: 107913.906643297"
## [1] "Boot 24"
## [1] "n_iter = 97"
## [1] "24: 108867.124135728"
## [1] "Boot 25"
## [1] "n_iter = 102"
## [1] "25: 103793.935827009"
## [1] "Boot 26"
## [1] "n_iter = 101"
## [1] "26: 112376.185170527"
## [1] "Boot 27"
## [1] "n_iter = 137"
## [1] "27: 104190.071938615"
## [1] "Boot 28"
## [1] "n_iter = 91"
## [1] "28: 108547.047660375"
## [1] "Boot 29"
## [1] "n_iter = 135"
## [1] "29: 106536.699842896"
## [1] "Boot 30"
## [1] "n_iter = 111"
## [1] "30: 106967.004849557"
## [1] "Boot 31"
## [1] "n_iter = 112"
## [1] "31: 105352.580352336"
## [1] "Boot 32"
## [1] "n_iter = 101"
## [1] "32: 108380.356903598"
## [1] "Boot 33"
## [1] "n_iter = 88"
## [1] "33: 107151.752482793"
## [1] "Boot 34"
## [1] "n_iter = 100"
## [1] "34: 103363.369825211"
## [1] "Boot 35"
## [1] "n_iter = 94"
## [1] "35: 107556.57777492"
## [1] "Boot 36"
## [1] "n_iter = 88"
## [1] "36: 102367.702452195"
## [1] "Boot 37"
## [1] "n_iter = 104"
## [1] "37: 109824.813003086"
## [1] "Boot 38"
## [1] "n_iter = 103"
## [1] "38: 98437.7300599913"
## [1] "Boot 39"
## [1] "n_iter = 120"
## [1] "39: 108218.831273228"
## [1] "Boot 40"
## [1] "n_iter = 110"
## [1] "40: 106128.312037435"
## [1] "BS Turnout: 99893 106578 110822"
gg_bs_hist <- hist_bootstrap(bs)
print(gg_bs_hist)
gg_turnout <- turnout_plot(
bs,
config
)
print(gg_turnout)
There are a few specifics that need to be handled on election day. The file run_all.R does five things: (a) download the google data, (b) calculate the bootstrapped estimate, (c) compiles election_tracker.Rmd, which creates the markdown report, (d) pushes the html to github using upload_git.bat, and then (e) maybe tweets an update.
On election day, you just run run_all.R and let the script loop.